Temporal classification for historical Romanian texts
نویسندگان
چکیده
In this paper we look at a task at border of natural language processing, historical linguistics and the study of language development, namely that of identifying the time when a text was written. We use machine learning classification using lexical, word ending and dictionary-based features, with linear support vector machines and random forests. We find that lexical features are the most helpful.
منابع مشابه
Temporal Text Ranking and Automatic Dating of Texts
This paper presents a novel approach to the task of temporal text classification combining text ranking and probability for the automatic dating of historical texts. The method was applied to three historical corpora: an English, a Portuguese and a Romanian corpus. It obtained performance ranging from 83% to 93% accuracy, using a fully automated approach with very basic features.
متن کاملTemporal Text Classification for Romanian Novels set in the Past
In this paper we look at a task in historical linguistics and the study of language development, namely that of identifying the time when a text was written. The novelty is that we evaluate our classifier and our selected features on literary texts having their action placed in the past and written so as to give off the impression of the respective epoch. We investigate several types of feature...
متن کاملOn the annotation of vague expressions: a case study on Romanian historical texts
Current approaches in Digital .Humanities tend to ignore a central aspect of any hermeneutic introspection: the intrinsic vagueness of analyzed texts. Especially when dealing with historical documents neglecting vagueness has important implications on the interpretation of the results. In this paper we present current limitation of annotation approaches and describe a current methodology for an...
متن کاملRomanian TimeBank: An Annotated Parallel Corpus for Temporal Information
The paper describes the main steps for the construction, annotation and validation of the Romanian version of the TimeBank corpus. Starting from the English TimeBank corpus – the reference annotated corpus in the temporal domain, we have translated all the 183 English news texts into Romanian and mapped the English annotations onto Romanian, with a success rate of 96.53%. Based on ISO-Time the ...
متن کاملStylistic Changes for Temporal Text Classification
This paper investigates stylistic changes in a set of Portuguese historical texts ranging from the 17 to the early 20 century and presents a supervised method to classify them per century. Four stylistic features – average sentence length (ASL), average word length (AWL), lexical density (LD), and lexical richness (LR) – were automatically extracted for each sub-corpus. The initial analysis of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013